Joint entity recognition and relation extraction as a multi-head selection problem

Abstract

State-of-the-art models for joint entity recognition and relation extraction strongly rely on external natural language processing (NLP) tools such as POS (part-of-speech) taggers and dependency parsers. Thus, the performance of such joint models depends on the quality of the features obtained from these NLP tools. However, these features are not always accurate for various languages and contexts. In this paper, we propose a joint neural model which performs entity recognition and relation extraction simultaneously, without the need of any manually extracted features or the use of any external tool. Specifically, we model the entity recognition task using a CRF (Conditional Random Fields) layer and the relation extraction task as a multi-head selection problem (i.e., potentially identify multiple relations for each entity). We present an extensive experimental setup, to demonstrate the effectiveness of our method using datasets from various contexts (i.e., news, biomedical, real estate) and languages (i.e., English, Dutch). Our model outperforms the previous neural models that use automatically extracted features, while it performs within a reasonable margin of feature-based neural models, or even beats them.

摘要

用于联合实体识别和关系提取的最先进模型强烈依赖于外部自然语言处理（NLP）工具，例如POS（词性）标记器和依赖性解析器。因此，这种联合模型的性能取决于从这些NLP工具获得的特征的质量。但是，对于各种语言和上下文，这些功能并不总是准确的。在本文中，我们提出了一个联合神经模型，它同时执行实体识别和关系提取，无需任何手动提取的功能或使用任何外部工具。具体地，我们使用CRF（条件随机场）层和关系提取任务将实体识别任务建模为多头选择问题（即，潜在地识别每个实体的多个关系）。我们提出了一个广泛的实验设置，以证明我们的方法使用来自各种环境（即新闻，生物医学，房地产）和语言（即英语，荷兰语）的数据集的有效性。我们的模型优于以前使用自动提取特征的神经模型，同时它在基于特征的神经模型的合理范围内执行，甚至胜过它们。

标题	说明	时间
Joint entity recognition and relation extraction as a multi-head selection problem	论文原文	2018
multihead_joint_entity_relation_extraction	论文实现	20190429
对抗训练多头选择的实体识别和关系抽取的联合模型	论文解析	20181005

数据处理流程

1 train.txt

#doc 2050
0	Mrs.	B-Peop	['N']	[0]
1	Rose	I-Peop	['N']	[1]
2	hired	O	['N']	[2]
3	Abebe	B-Peop	['N']	[3]
4	Worke	I-Peop	['Work_For', 'Live_In']	[22, 8]
5	,	O	['N']	[5]
6	one	O	['N']	[6]
7	of	O	['N']	[7]
8	Ethiopia	B-Loc	['N']	[8]
9	's	O	['N']	[9]
10	most	O	['N']	[10]
11	distinguished	O	['N']	[11]
12	lawyers	O	['N']	[12]
13	and	O	['N']	[13]
14	a	O	['N']	[14]
15	former	O	['N']	[15]
16	member	O	['N']	[16]
17	of	O	['N']	[17]
18	the	O	['N']	[18]
19	country	O	['N']	[19]
20	's	O	['N']	[20]
21	High	B-Org	['N']	[21]
22	Court	I-Org	['N']	[22]
23	,	O	['N']	[23]
24	to	O	['N']	[24]
25	investigate	O	['N']	[25]
26	.	O	['N']	[26]

2 Input data

+--------------------------+---------------------------------------------------------------------------+
|        Input data        |                                   Value                                   |
+--------------------------+---------------------------------------------------------------------------+
|  dropout_embedding_keep  |                                     1                                     |
|    dropout_lstm_keep     |                                     1                                     |
| dropout_lstm_output_keep |                                     1                                     |
|   dropout_fcl_ner_keep   |                                     1                                     |
|   dropout_fcl_rel_keep   |                                     1                                     |
|         isTrain          |                                    True                                   |
|         charIds          |                 [[[35 68 69  9  0  0  0  0  0  0  0  0  0]                |
|                          |                   [40 65 69 55  0  0  0  0  0  0  0  0  0]                |
|                          |                   [58 59 68 55 54  0  0  0  0  0  0  0  0]                |
|                          |                   [23 52 55 52 55  0  0  0  0  0  0  0  0]                |
|                          |                   [45 65 68 61 55  0  0  0  0  0  0  0  0]                |
|                          |                   [ 7  0  0  0  0  0  0  0  0  0  0  0  0]                |
|                          |                   [65 64 55  0  0  0  0  0  0  0  0  0  0]                |
|                          |                   [65 56  0  0  0  0  0  0  0  0  0  0  0]                |
|                          |                   [27 70 58 59 65 66 59 51  0  0  0  0  0]                |
|                          |                   [ 4 69  0  0  0  0  0  0  0  0  0  0  0]                |
|                          |                   [63 65 69 70  0  0  0  0  0  0  0  0  0]                |
|                          |                   [54 59 69 70 59 64 57 71 59 69 58 55 54]                |
|                          |                   [62 51 73 75 55 68 69  0  0  0  0  0  0]                |
|                          |                   [51 64 54  0  0  0  0  0  0  0  0  0  0]                |
|                          |                   [51  0  0  0  0  0  0  0  0  0  0  0  0]                |
|                          |                   [56 65 68 63 55 68  0  0  0  0  0  0  0]                |
|                          |                   [63 55 63 52 55 68  0  0  0  0  0  0  0]                |
|                          |                   [65 56  0  0  0  0  0  0  0  0  0  0  0]                |
|                          |                   [70 58 55  0  0  0  0  0  0  0  0  0  0]                |
|                          |                   [53 65 71 64 70 68 75  0  0  0  0  0  0]                |
|                          |                   [ 4 69  0  0  0  0  0  0  0  0  0  0  0]                |
|                          |                   [30 59 57 58  0  0  0  0  0  0  0  0  0]                |
|                          |                   [25 65 71 68 70  0  0  0  0  0  0  0  0]                |
|                          |                   [ 7  0  0  0  0  0  0  0  0  0  0  0  0]                |
|                          |                   [70 65  0  0  0  0  0  0  0  0  0  0  0]                |
|                          |                   [59 64 72 55 69 70 59 57 51 70 55  0  0]                |
|                          |                  [ 9  0  0  0  0  0  0  0  0  0  0  0  0]]]               |
|        tokensLens        | [[ 4  4  5  5  5  1  3  2  8  2  4 13  7  3  1  6  6  2  3  7  2  4  5  1 |
|                          |                                   2 11  1]]                               |
|       embeddingIds       |  [[  3806   1984   4334 117254     11      5     69     10   7141     29  |
|                          |       155   3359   5857     13     18    360    402     10      6    207  |
|                          |                  29    219    593      5     16   7011      8]]           |
|     entity_tags_ids      |         [[3 7 8 3 7 8 8 8 0 8 8 8 8 8 8 8 8 8 8 8 8 1 5 8 8 8 8]]         |
|       entity_tags        | [['B-Peop' 'I-Peop' 'O' 'B-Peop' 'I-Peop' 'O' 'O' 'O' 'B-Loc' 'O' 'O' 'O' |
|                          |    'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'B-Org' 'I-Org' 'O' 'O' 'O' 'O']]  |
|          tokens          |   [['Mrs.' 'Rose' 'hired' 'Abebe' 'Worke' ',' 'one' 'of' 'Ethiopia' "'s"  |
|                          |   'most' 'distinguished' 'lawyers' 'and' 'a' 'former' 'member' 'of' 'the' |
|                          |          'country' "'s" 'High' 'Court' ',' 'to' 'investigate' '.']]       |
|           BIO            | [['B-Peop' 'I-Peop' 'O' 'B-Peop' 'I-Peop' 'O' 'O' 'O' 'B-Loc' 'O' 'O' 'O' |
|                          |    'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'B-Org' 'I-Org' 'O' 'O' 'O' 'O']]  |
|         tokenIds         | [[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|                          |                                  24 25 26]]                               |
|    scoringMatrixGold     |                         [[[0. 0. 0. ... 0. 0. 0.]                         |
|                          |                           [0. 0. 0. ... 0. 0. 0.]                         |
|                          |                           [0. 0. 0. ... 0. 0. 0.]                         |
|                          |                                     ...                                   |
|                          |                           [0. 0. 0. ... 0. 0. 0.]                         |
|                          |                           [0. 0. 0. ... 0. 0. 0.]                         |
|                          |                          [0. 0. 0. ... 1. 0. 0.]]]                        |
|          seqlen          |                                    [27]                                   |
|         doc_ids          |                               ['#doc 2050']                               |
+--------------------------+---------------------------------------------------------------------------+

1 2	relations = ['Kill', 'Live_In', 'Located_In', 'N', 'OrgBased_In', 'Work_For'] scoringMatrixGold.shape = (1, 27, 162) #(batch_size, sequence_length, sequence_length * relation_numbers)

由原始数据 4 Worke I-Peop [‘Work_For’, ‘Live_In’] [22, 8] 可以知道原始数据第4个token Worke 对应有两个关系分别是与第22个token的Work_For(对应ID是 5) 和第8个token的live_In(对应ID是 1)。如果把

1
2
3

scoringMatrixGold.shape = (1, 27, 162) #(batch_size, sequence_length, sequence_length * relation_numbers)
转换成
scoringMatrixGold.shape = (1, 27, 27, 6) #(batch_size, sequence_length, sequence_length, relation_numbers)

那么 scoringMatrixGold 里面的这个 #doc 2050 数据 scoringMatrixGold[0] 的形状是 (27, 27, 6)，第4个token Worke就是 scoringMatrixGold[0][4] ，这是一个形状是 (27, 6) 的关系矩阵，这个矩阵的第8行第1列以及第22行第5列的数字是1，其余是0。这样就用one-hot形式的矩阵编码了实体之间的关系。矩阵的具体内容如下：

"[" \
"0. 0. 0. 0. 0. 0. " \
"0. 0. 0. 0. 0. 0. " \
"0. 0. 0. 0. 0. 0. " \
"0. 0. 0. 0. 0. 0. " \
"0. 0. 0. 0. 0. 0. " \
"0. 0. 0. 0. 0. 0. " \
"0. 0. 0. 0. 0. 0. " \
"0. 0. 0. 0. 0. 0. " \
"0. 1. 0. 0. 0. 0. " \
"0. 0. 0. 0. 0. 0. " \
"0. 0. 0. 0. 0. 0. " \
"0. 0. 0. 0. 0. 0. " \
"0. 0. 0. 0. 0. 0. " \
"0. 0. 0. 0. 0. 0. " \
"0. 0. 0. 0. 0. 0. " \
"0. 0. 0. 0. 0. 0. " \
"0. 0. 0. 0. 0. 0. " \
"0. 0. 0. 0. 0. 0. " \
"0. 0. 0. 0. 0. 0. " \
"0. 0. 0. 0. 0. 0. " \
"0. 0. 0. 0. 0. 0. " \
"0. 0. 0. 0. 0. 0. " \
"0. 0. 0. 0. 0. 1. " \
"0. 0. 0. 0. 0. 0. " \
"0. 0. 0. 0. 0. 0. " \
"0. 0. 0. 0. 0. 0. " \
"0. 0. 0. 0. 0. 0.]"

说明：由于我们假设基于令牌的编码，我们只考虑实体的最后一个令牌作为另一个令牌的头部，从而消除了冗余关系。例如，实体“John Smith”和“Disease Control Center”之间存在关联工作。我们不是连接实体的所有标记，而是仅将“Smith”与“Center”连接起来。此外，对于没有关系的情况，我们引入“N”标签，我们将令牌本身预测为头部。

Since we assume token-based encoding, we consider only the last token of the entity as head of another token, eliminating redundant relations. For instance, there is a Works for relation between entities “John Smith” and “Disease Control Center”. Instead of connecting all tokens of the entities, we connect only “Smith” with “Center”. Also, for the case of no relation, we introduce the “N” label and we predict the token itself as the head.